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Abstract 

In this paper we address the problem of the prohibitively large computational cost of ex- 
isting Markov chain Monte Carlo methods for large-scale applications with high dimensional 
parameter spaces, e.g. in uncertainty quantification in porous media flow. We propose a new 
multilevel Metropolis-Hastings algorithm, and give an abstract, problem dependent theorem 
on the cost of the new multilevel estimator based on a set of simple, verifiable assumptions. 
For a typical model problem in subsurface flow, we then provide a detailed analysis of these 
assumptions and show significant gains over the standard Metropolis-Hastings estimator. Nu- 
merical experiments confirm the analysis and demonstrate the effectiveness of the method with 
consistent reductions of a factor of 0(10-50) in the e-cost of the multilevel estimator over the 
standard Metropolis-Hastings algorithm for tolerances e around 10~ 3 . 
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1 Introduction 

The parameters in mathematical models for many physical processes are often impossible to deter- 
mine fully or accurately, and are hence subject to uncertainty. It is of great importance to quantify 
the uncertainty in the model outputs based on the (uncertain) information that is available on the 
model inputs. A popular way to achieve this is stochastic modelling. Based on the available infor- 
mation, a probability distribution (the prior in the Bayesian framework) is assigned to the input 
parameters. If in addition, some dynamic data (or observations) related to the model outputs 
are available, it is possible to reduce the overall uncertainty and to get a better representation of 
the model by conditioning the prior distribution on this data (leading to the posterior). 

In most situations, however, the posterior distribution is intractable in the sense that exact 
sampling from it is unavailable. One way to circumvent this problem, is to generate samples using 
a Metropolis-Hastings type Markov chain Monte Carlo (MCMC) approach [211 125] I27j . which 
consists of two main steps: (i) given the previous sample, a new sample is generated according to 
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some proposal distribution, such as a random walk; (ii) the likelihood of this new sample (i.e. the 
model fit to -Fobs) is compared to the likelihood of the previous sample. Based on this comparison, 
the proposed sample is then either accepted and used for inference, or it is rejected and we use 
instead the previous sample again, leading to a Markov chain. A major problem with MCMC 
is the high cost of the likelihood calculation for large-scale applications, e.g. in subsurface flow 
where it involves the numerical solution of a partial differential equation (PDE) with highly varying 
coefficients on a - for accuracy reasons - very fine spatial grid. Due to the slow convergence of 
Monte Carlo averaging, the number of samples is also large and moreover, the likelihood has to be 
calculated not only for the samples that are eventually used for inference, but also for the samples 
that end up being rejected. Altogether, this leads to an often impossibly high overall complexity, 
particularly in the context of high-dimensional parameter spaces (typically needed in subsurface 
flow applications), where the acceptance rate of the algorithm can be very low. 

We show here how the computational cost of the standard Metropolis-Hastings algorithm can be 
reduced significantly by using a multilevel approach. This has already proved highly successful for 
subsurface flow problems in the context of standard Monte Carlo estimators based on independent 
and identically distributed (i.i.d.) samples H US El ED] . The multilevel Monte Carlo (MLMC) 
method was first introduced by Heinrich for the computation of high-dimensional, parameter- 
dependent integrals |22| , and then rediscovered by Giles |T7] in the context of infinite-dimensional 
integration in stochastic differential equations in finance. Similar ideas were also used by Brandt 
and his co-workers to accelerate statistical mechanics calculations [21 EJ- The basic ideas are to 
(i) exploit the linearity of expectation, (ii) introduce a hierarchy of computational models that are 
assumed to converge (as the model resolution is increased) to some limit model (e.g. the original 
PDE), and (iii) build estimators for differences of output quantities instead of estimators for the 
quantities themselves. In the context of PDEs with random coefficients, the multilevel estimators 
use a hierarchy of spatial grids and exploit that the numerical solution of a PDE on a coarser 
spatial grid, and thus the evaluation of the likelihood, is computationally much cheaper than on a 
fine grid. In that way each individual estimator will either have a smaller variance, since differences 
of output quantities from two consecutive models go to zero with increased model resolution, or 
it will require significantly less computational work per sample for low model resolutions. Either 
way the cost of each of the individual estimators is significantly reduced, easily compensating for 
the cost of having to compute L + 1 estimators instead of one, where L is the number of levels. 

However, the application of the multilevel approach in the context of MCMC is not straight- 
forward. The posterior distribution, which depends on the likelihood, has to be level-dependent, 
since otherwise the cost on all levels will be dominated by the evaluation of the likelihood on the 
finest level leading to no real cost reduction. Instead, and in order to avoid introducing extra bias 
in the estimator, we construct two parallel Markov chains {0™} n >o an d {@™_i} n >o on levels t and 
I — 1 each from the correct posterior distribution on the respective level. The coarser of the two 
chains is constructed using the standard Metropolis-Hastings algorithm, for example using a (pre- 
conditioned) random walk. The main innovation is a new proposal distribution for the finer of the 
two chains {#™}n>o- Although similar two- level sampling strategies have been investigated in other 
applications [131 Q3! > the computationally cheaper coarse models were only used to accelerate 
the MCMC sampling and not as a variance reduction technique in the estimator. Some ideas on 
how to obtain a multilevel version of the MCMC estimator can also be found in the recent work 
on sparse MCMC finite element methods. 



The central result of the paper is a complexity theorem (cf. Theorem 3.5) that quantifies, 
for an abstract large-scale inference problem, the gains in the e-cost of the multilevel Metropolis- 
Hastings algorithm over the standard version in terms of powers of the tolerance e. For a particular 
application in stationary, single phase subsurface flow (with a lognormal permeability prior with 



exponential covariance), we then verify all the assumptions in Theorem 3.5 We show that the 



2 



e-cost of our new multilevel version is indeed one order of e lower than its single-level counterpart 



(cf. Theorem 4.8), i.e. 0(e~( d+1 ^~ 5 ) instead of 0(e~( d+2 ^ 5 ), for any 5 > 0, where d is the spatial 
dimension of the problem. The numerical experiments for d = 2 in Section [5] confirm all these 
theoretical results. In fact, in practice it seems that the cost for the multilevel estimator grows 
only like 0(e~ d ), but this seems to be a pre-asymptotic effect. The absolute cost is about 0(10-50) 
times lower than for the standard estimator for values of e around 10 -3 , which is a vast improvement 
and brings the cost of the multilevel MCMC estimator down to a similar order than the cost of 
standard multilevel MC estimators based on i.i.d. samples. This provides real hope for practical 
applications of MCMC analyses in subsurface flow and for other large scale PDE applications. 

The outline of the rest of the paper is as follows. In Section [2j we recall, in a very general 
context, the Metropolis Hastings algorithm, together with results on its convergence. In Section 
[3j we then present a new multilevel version and give a general convergence analysis under certain, 
problem-dependent, but verifiable assumptions. A typical model problem arising in subsurface flow 
modelling is then presented in Section 4. We briefly describe the application of the new multilevel 
algorithm to this application, and give a rigorous convergence analysis and cost estimate of the 
new multilevel estimator by verifying the abstract assumptions from Section [3j Finally, in Section 
[5j we present some numerical results for the model problem discussed in Section |4j 

2 Standard Markov chain Monte Carlo 

We will start in this section with a review of the standard Metropolis Hastings algorithm, described 
in a general context. We denote by 9 := (9i)f =l the M^-valued random input vector to the model, 
and denote by X := (Xj)jL 1 = X(6) the M M -valued random output. Let further Qm,r = G(X) 
be some linear or non-linear functional of X. We shall often refer to M as the discretisation level 
of the model. 

We consider the setting where we have some real-world dynamic data (or observations) F ^, s 
available, and want to incorporate this information into our simulation in order to reduce the 
overall uncertainty. The data .Fobs usually corresponds to another linear or non-linear functional T 
of the model output. In the context of groundwater flow modelling, this could for example be the 
value of the pressure or the Darcy flux at or around a given point in the computational domain, 
or the outflow over parts of the boundary. 

Let us denote the conditional distribution of 6 given _F b s by tt m,r (9). We assume that as 
M,R — > oo, we have E t «,h [Qm,r — Q] — > for some (inaccessible) random variable Q. We are 
interested in estimating E^m.b [Q], for M, R sufficiently large. Hence, we compute approximations 
(or estimators) Qm,r of E^m.h [Qm,r\- To estimate this with a Monte Carlo type estimator, or 
in other words by a finite sample average, we need to generate samples from the conditional 
distribution -k m,r . Using Bayes' Theorem, we have 

^R {6 ) := v{e | Fobg) = ^ S \9)V(9) _ ^ 

l J F (-Fobs) 

Since the normalising constant VF(F bs) is n °t known in general, the conditional distribution 7r M ' R 
is generally intractable and exact sampling not available. 

For the remainder of the paper, we will refer to the conditional distribution tt m,r (9) as the 
posterior distribution, to £(F b s | 9) as the likelihood and to V(9) as the "prior distribution. The 
likelihood gives the probability of observing the data F Q b s given a particular value of 9, and usually 
involves computing the model response Fm,r '■= J 7 (X(9)) and comparing this to the observed data 
F b s . Note that since the model output depends on the discretisation parameter M, the likelihood 
and hence the posterior distribution 7r M ' R will in general also depend on M. As already mentioned, 
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ALGORITHM 1. (Metropolis Hastings MCMC) 

Choose 9°. For n > 0: 

• Given 9 n , generate a proposal 9' from a given proposal distribution q(9'\9 n ). 

• Accept 9' sample with probability 

a (9\9 )-mm|l^ M ^ n) (2.2) 
i.e. <9 n+1 = #' with probability a M ' fi and 6» n+1 = 6> n with probability 1 - a M ' fi . 



the posterior distribution ir M > R is usually intractable. In order to generate samples for inference, 
we will use the Metropolis Hastings MCMC algorithm in Algorithm 1. 

Algorithm 1 creates a Markov chain {9 n } n £?q, and the states 9 n are used in the usual way 
as samples for inference in a Monte Carlo sampler. The proposal distribution q(9'\9 n ) is what 
defines the algorithm. A common choice is a simple random walk. However, as outlined in |20j, 
the basic random walk does not lead to dimension R independent convergence, and a better choice 
is a preconditioned Crank-Nicholson (pCN) algorithm [IT]. Below we will see that it is also the 
crucial ingredient in our multilevel Metropolis-Hastings algorithm. When the proposal distribution 
is symmetric, i.e. when q(9 n \9') = q(9 > \9 n ), then the formula for a M,R (9'\9 n ) in (2.2) simplifies. 

Under reasonable assumptions, one can show that 9^ ~ tt m,r , as n — > oo, and that sample 
averages computed with these samples converge to expected values with respect to the desired 



target distribution tt m > r (see Theorem 2.2). The first several samples of the chain {6> n } ng N, say 
8°, . . . , 9 n °, are not usually used for inference, since the chain needs some time to get close to the 
target distribution ir ' R . This is referred to as the burn-in of the MCMC algorithm. Although 
the length of the burn-in is crucial for practical purposes, and largely influences the behaviour of 
the resulting MCMC estimator for finite sample sizes, statements about the asymptotics of the 
estimator are usually independent of the burn-in. We will therefore denote our MCMC estimator 
by 

-. N+n N+n 

Qn c ■= ^ E Qm, r = I, E e wn) , (2.3) 

n=no n=no 

for any no > 0, and only explicitly state the dependence on no where needed. 
2.1 Convergence analysis 

We will now give a brief overview of the convergence properties of the Metropolis-Hastings algo- 
rithm, which we will need below in the analysis of the multilevel variant. For more details we refer 
the reader, e.g., to [27]. Let 

K{9\8') := a M ' R {6\8') q{9\9') + (l - J a M ' R {9\9') q{9\9') d9'^j 5{9 - 9') 

denote the transition kernel of the Markov chain {9 n } n ^, with <5(-) the Dirac delta function, and 

£ = {9: tt M ' R (9) > 0}, 

V = {9: q{8\9') > for some 8' £ £}. 



4 



The set £ contains all parameter vectors which have a positive posterior probability, and is the 
set that Algorithm 1 should sample from. The set T>, on the other hand, consists of all samples 
which can be generated by the proposal distribution q, and hence contains the set that Algorithm 1 
will actually sample from. For the algorithm to fully explore the target distribution, we therefore 
crucially require £ C T>. The following results are classical, and can be found in |27j . 

Lemma 2.1. Provided £ C T>, ir M,R is a stationary distribution of the chain {6 n } n ^. 

Note that the condition £ C T> is sufficient for the transition kernel K(-\-) to satisfy the usual 
detailed balance condition K{6\9') tt m ' r (6') = K{6'\6) it m > r {6). 



Theorem 2.2. Suppose thatE n M,n [\Qm,r\] < oo and 

q[9\e') > 0, for all (6,6') G £ x £. 



(2.4) 



Then 



lim Q^ c 



E n M,R [Qm,r] , for any 6° € £ and uq > 0. 



The condition (2.4) is sufficient for the chain {6 n } n ^ to be irreducible, and it is satisfied 



for example for the random walk sampler or for the pCN algorithm (cf. [20]). Lemma 2.1 and 



Theorem |2.2| above ensure that asymptotically, sample averages computed with samples generated 
by Algorithm 1 converge to the desired expected value. In particular, we note that stationarity of 
{6 n } n( zfq is not required for Theorem 2.2, and the above convergence results hence hold true for 



any burn-in uq > 0, and for all initial values 6° £ £. 



Now that we have established the (asymptotic) convergence of the MCMC estimator (2.3), let 



us establish a bound on the cost of this estimator. We will quantify the accuracy of our estimator 
via the root mean square error (RMSE) 



E 



e 



1/2 



(2.5) 



where E© denotes the expected value not with respect to the target measure -k m,r , but with respect 
to the joint distribution of := {6 n } n ^ as generated by Algorithm 1. We denote by C e (Q^ c ) 
the computational e-cost of the estimator, that is the number of floating point operations that are 
needed to achieve a RMSE of e(Q^°) < e. 

Classically, the mean square error (MSE) can be written as the sum of the variance of the 
estimator and its bias squared, 



e(Q 



MC\2 _ v 
N ) ~ V© 



r>MC 



+ E 



[Q. 



Here, V© is again the variance with respect to the approximating measure generated by Algo- 
rithm 1. Using the triangle inequality and linearity of expectation, we can further write this as 



Q 



MC 
N 



+ 2(E© 



Q 



MC 
N 



nMC 



+ 2(E n M, R [Qm,r-Q]Y 



(2.6) 



The three terms in (2.6) correspond to the three sources of error in the MCMC estimator. The 



third (and last) term in (2.6) is the discretisation error due to approximating Q by Qm,r- The 



other two terms are the errors introduced by using an MCMC estimator for the expected value; 
the first term is the error due to using a finite sample average and the second term is due to the 
samples in the estimator not all being perfect (i.i.d.) samples from the target distribution tt m,r . 
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Let us first consider the two MCMC related error terms. Quantifying, or even bounding, the 
variance and bias of an MCMC estimator in terms of the number of samples N is not an easy task, 
and is in fact still a very active area of research. The main issue with bounding the variance is that 
the samples used in the MCMC estimator are not independent, which means that knowledge of the 
covariance structure is required in order to bound the variance of the estimator. Asymptotically, 



the behaviour of the MCMC related errors (i.e. Terms 1 and 2 on the right hand side of (2.6)) can 
be described using the following Central Limit Theorem, which can again be found in |27j . 

Let 6° ~ tt M ' R . Then the auxiliary chain := {9 n } n <=^ constructed by Algorithm 1 starting 
from 9° is stationary, i.e. 9 n ~ ^ M > R for all n > 0. Note that the covariance structure of is 
still implicitly defined by Algorithm 1 as for 0. However, now Vq[Qmr] = ^■k m ^[Qm,r] and 
E © [Qm,r\ = ^^MQm,r], for any n > 0, and 



Cov T 



MM -M.R 



Qm,R: Q 



M,R 



E. 



MM „M,R 



Q 





M,R 



[Q 



M,R_ 



Q 



M.R 



E % m,r [Qm,r] 



where Q n MR := Q{X{9 n )) and E n y[Z] = J RR J RR Z(9,9') dvr(0) dvr'(^), for a random variable Z 
that depends on 9 and 6'. We now define the so called asymptotic variance of the MCMC estimator 



Q 



M,R 



+ 2 ^ CoV ^MM^M.R 



n=l 



Q°M,Ri Qm,R 



Note that stationarity of the chain is assumed only in the definition of <Tq, i.e. for 0, and it is not 
necessary for the samples actually used in the computation of Q^ c . 

Theorem 2.3 (Central Limit Theorem). Suppose Oq < oo, (2.4) holds, and 

P [a M > R = 1] < 1. 

Then we have 

EjpM.-R [Q m ,r]) AAA(0,a^), 



(2.7) 



Q 



MC 
N 



where 



D 



denotes convergence in distribution. 



The condition (2.7) is sufficient for the chain to be aperiodic. It is difficult to prove theoreti- 



cally. In practice, however, this condition is always satisfied, since not all proposals in Algorithm 1 
will agree with the observed data and thus be accepted. 



Theorem 2.3 holds again for any burn-in uq > and any starting value 9° G £■ It shows that 
asymptotically, the sampling error of the MCMC estimator decays at a similar rate to the sampling 
error of an estimator based on i.i.d. samples. Note that this includes both sampling errors, and so 
the constant a% is in general larger than in the i.i.d. case where it is simply Y n M,R [Qm,r\- 

Since we are interested in a bound on the MSE of our MCMC estimator for a fixed number of 
samples N, we make the following assumption: 



Al. For any JVeN, 



Q 



MC 
N 



+ E© 



Q 



MC 
N 



E, 



■qMC 



2 y^mAQm,r] 



< 



N 



{21 



with a constant that is independent of M, N and R. 



Non-asymptotic bounds such as in Assumption Al are difficult to obtain, but have recently 
been proved for certain Metropolis-Hastings algorithms, see e.g. [20l |28l |23] . These results require 
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that the chain is sufficiently burnt-in. The hidden constant usually depends on quantities such as 
the covariances appearing in the asymptotic variance Oq. 



To complete the error analysis, let us now consider the last term in the MSE (2.6), the dis- 
cretisation bias. As before, we assume E x m,a [Qm,r — Q] — > as M, R — y oo, and we furthermore 
assume that we have a certain order of convergence, i.e. 

\Km,r [Qm,r ~Q]\< M~ a + R~ a \ (2.9) 
for some a, a' > 0. The rates a and a' will be problem dependent. Let now R = M a l a , such that 



the two error contributions in (2.9) are balanced. Then it follows from (2.6), (2.8) and (2.9) that 



the MSE of the MCMC estimator can be bounded by 

e(^ C ) 2 < V ^ 9M ' fl] +M-». (2.10) 

Under the assumption that Y n M,R [Qm,r] ~ constant, independent of M and R, it is hence sufficient 
to choose N > e~ 2 and M > e" 1 /" to get a RMSE of 0(e). 

Let us now give a bound on the computational cost to achieve this error, the so called e-cost. 
For this, assume that the cost to compute one sample QIj R satisfies < M 7 , for some 

7 > 0. Thus, with N > e~ 2 and M > e~ 1/a , the e-cost of our MCMC estimator can be bounded 
by 

C £ (Q^ c )<NM^<e- 2 -'/ a . (2.11) 

In practical applications, especially in subsurface flow, both the discretisation parameter M and 
the length of the input random vector R usually need to be very large in order for E^m.h [Qm.r] to 
be a good approximation to E^oo [Q]. Moreover, from the analysis above, we see that we need to use 
a large number of samples N in order to get an accurate MCMC estimator with a small MSE. Since 
each sample requires the evaluation of the likelihood C(F Q \ }S \9 n ), and this is very expensive when 



M and R are very large, the standard MCMC estimator (2.3) is often extraordinarily expensive in 
practical situations. Additionally, the acceptance rate of the algorithm can be very low when R is 
very large. This means that the covariance between the different samples will decay more slowly, 
which again makes the hidden constant in Assumption Al larger, and the number of samples we 
have to take in order to get a certain accuracy increases even further. 



To overcome the prohibitively large computational cost of the standard MCMC estimator (2.3) 
we will now introduce a new multilevel version of the estimator. 



3 Multilevel Markov chain Monte Carlo algorithm 

The main idea of multilevel Monte Carlo (MLMC) simulation is very simple. We sample not just 
from one approximation Qm,R of Q, but from several. Let us recall the main ideas from |17| 19"]. 

Let {Mg : i = 0, . . . , L] be an increasing sequence in N, i.e. Mq < Mi < . . . < Ml =: M, and 
assume for simplicity that there exists an s £ N\{1} such that 

M/ = «M/_i, for alU= 1,...,L. (3.1) 

We also choose a (not necessarily strictly) increasing sequence {Re}f =0 C N, i.e. Rg > Rt-\, for 
all t = 1, . . . ,L. For each level £, denote correspondingly the parameter vector by 6e S M. Re , the 
quantity of interest by Qg := Qu t ,R t and the posterior distribution by it 1 := n M e> R i. 

As for multigrid methods applied to discretised (deterministic) PDEs, the key is to avoid es- 
timating the expected value of Qg directly on level £, but instead to estimate the correction with 
respect to the next lower level. Since in the context of MCMC simulations, the target distribution 
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7T^ depends on £, the new multilevel MCMC (MLMCMC) estimator has to be defined carefully. We 
will use the identity 

L 

E„l[Q l ] = E T o[Q ] + J^Kty-ilQt-Qe-i] (3-2) 
basis, where by the linearity of expectation 



E^-i [Q/-Q 



e-i 



K*[Qt 



d^-n^-i; 



[Qi 



(3.3) 



The idea of the multilevel estimator is now to estimate each of the terms on the right hand 



side of (3.2) independently, in a way that minimises the variance of the estimator for a fixed 



computational cost. In particular, we will estimate each term in (3.2 ) by an MCMC estimator. The 



first term E^o [Qq] can be estimated using the standard MCMC estimator described in Algorithm 1, 
i.e. Qqjv as in (2.3) with Nq samples. We need to be more careful in estimating the differences 
K n i n i-i[Qi — Qe-i], and build an effective two-level version of Algorithm 1. For every £ > 1, we 
denote Ye := Qe — Qe-i and define the estimator on level £ as 



yMC 



1 



n*+N e 

£ 

£ 

n=nQ 



Y, 



(n) 



1 



n l +N e 

E 



Qe 



-lh 



where rig again denotes the burn-in of the estimator and Ng is the number of samples on level £ 
The main ingredient in this two level estimator is a judicious choice of the two input vectors 91 



and (see Section 3.1). The full MLMCMC estimator is now defined as 



^L,{N e } 



, Vv MC 



(3.4) 



i=i 



where it is important (i) that all the chains that are used to produce the L + l estimators in (3.4) 
are independent, and (ii) that the two chains {6*™} ne N and {©"} n eN> that are used in Y^P and 
in Yi^ijf t l respectively, are drawn from the same posterior distribution ir e , so that QY{N e \ ^ s an 
unbiased estimator of E^z-fQr]. 

There are two main ideas in \17\ [9] underlying the reduction in computational cost associated 
with the multilevel estimator. Firstly, samples of Qe, for £ < L, are cheaper to compute than 
samples of Ql, reducing the cost of the estimators on the coarser levels for any fixed number of 
samples. Secondly, if n e-i [Yt] — > as £ — > oo, we need only a small number of samples to obtain 
a sufficiently accurate estimate of ^e-i [Ye] on the fine grids, and so the computational effort on 
the fine grids is also greatly reduced. Here, 



{Y e -E n ey- 1 [Y £ })' 



(3.5) 



where the expectation E^ n e-i is as in (3.3). 



By using the telescoping sum (3.2) and by sampling from the posterior distribution ir on level 
we ensure that a sample of Qe, for £ < L, is indeed cheaper to compute than a sample of Ql- It 



remains to ensure that V„ 



[Ye] -> as £ — > oo. 
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ALGORITHM 2. (Metropolis Hastings MCMC for Q £ - Q t _{) 
Choose initial states &£_i and 8® := [0°_ x , 8® F ]. For n > 0: 

• On level £ — 1: Given generate ©"J^ 1 using Algorithm 1 with some proposal 
distribution q i,c | and acceptance probability 

W fn , in" N ■ L ^-H& t -lW' C (eU\&l-l) \ 

° ( e «l e ^) = mn I 1 ' ^-1(9^)^(6^16^) J • 

• On level Given 0" generate using Algorithm 1 with the specific proposal distri- 
bution q i (8' e | 0") induced by taking 8' iC := ©"^-j 1 and by generating a proposal for 
from some proposal distribution q ,F {0'ep \ &fp)- The acceptance probability is 



aHe'AOf) = min 



3.1 The estimator for — Qi-\ 

Let us for the moment fix 1 < £ < L. The challenge is now to generate the chains } n eN an d 
{0"_i}neN such that V n e ^t-i [Yg] is small. To this end, we partition the input vector 8g into two 
parts: the entries which are present already on level l — \ (the "coarse" modes), and the new entries 
on level £ (the "fine" modes): 

6 1 = [@t,c > ®1,f\ 

where 8gc has length Rg-i, i.e. the same length as The vector 0g t p has length Ri — Rg-i. 

An easy way to construct 6™ and Q^-i such that ^-i [Yg] is small, would be to generate 8™ 
first, and then simply use = @ec m H° wever J since we require @£_i to come from a Markov 

chain with stationary distribution ir i ~ 1 , and comes from the distribution tt , this approach is 
not permissible. We will, however, use this general idea in Algorithm 2. 

The coarse sample ©^l^ 1 is generated using the standard MCMC algorithm given in Algorithm 1, 
using, e.g., a random walk or the pCN proposal distribution [11] for q e ' C . Based on the outcome on 
level £ — 1, we then generate 0™ +1 , using a new two-level proposal distribution in conjunction with 
the usual accept/reject step from Algorithm 1. The proposal distribution q^' F for the fine modes 
in that step can again be via a simple random walk or the pCN algorithm. 

At each step in Algorithm 2, there are four different outcomes, depending on whether we accept 
on both, one or none of the levels. The different possibilities are given in Table [T] Observe that 
when we accept on level £, we always have O™^ 1 = O™^ 1 , i-e. the coarse modes are the same. 
If, on the other hand, we reject on level £, we crucially return to the previous state 8™ on that 
level, which means that the coarse modes of the two states may differ. They will definitely differ 
if we accept on level £ — 1 and reject on level £. If both proposals are rejected then it depends on 
the decision made at the previous state whether the coarse modes differ or not. In general, this 
"divergence" of the coarse modes may mean that Y w e does not go to as £ — > oo for a 

particular application. But provided the modes are ordered according to their relative "influence" 
on the likelihood C(F ^ S \8), we can guarantee that a e (8' e \8™) — > 1 and thus that ^-i^] — > 
as £ — > oo. We will show this for a subsurface flow application in Section [4} 
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Level t — 1 test 


Level £ test 


t — i 


on+l 
" e n 


reject 


accept 






accept 


accept 






reject 


reject 




an 
U i,C 


accept 


reject 




an 



Table 1: Possible states of and fl™^ 1 in Algorithm 2. 

The specific proposal distribution q e in Algorithm 2 can be computed very easily and at no 
additional cost, leading to a simple formula for the "two- level" acceptance probability or. 

Lemma 3.1. Let £ > 1. Then 



^{e' f )^- l {ei c )g^ (0i 

l ^^)^(9' c )q- ' 



a t (9' l \9 7 l) = min 

I vi ~\y t ) n ~ 

If we further suppose that the proposal distributions q ' and q e,F are symmetric, then 




to lol mn , • L ^ & t-i) \ , Hams ■ L Aod^-Heic)] 

^ C (©,- 1 l©"- 1 ) = mm|l, ^ 1(e ^ ) | and a\9 t | g?) = mm |l, ^ j . 

Proof. Let 0" and be any two admissible states on level £ Since the proposals for the coarse 
modes 0^c an d for the fine modes 0£ t p are generated independently, the transition probability 
q i {0\ \0f) can be written as a product of transition probabilities on the two parts of 0g. For the 
coarse level transition probability, we have to take into account the decision that was made on level 
£ — 1. Hence, 

<f{fi\q) = a i ' c (0l c \0l c ) q e > C (9l c \9l c ) q l > F {0ip\0ip). (3.6) 

and so 



min < 








9 lc) 1 




olc)<f> F m >F \ 








9 tc) J 


. qt> C (0l c \ 


9 £,f) 


min < 




tH 




°lc) 1 


> q^(9\ c \ 


9lc)^ F {0{ F \ 


9 a ) 


7t' 


-H0e,c)g l - c (elc\ 


9 ?,c) J 


U i,F> 



mm < i, 7r £-i(^ G ) ? «,c ( ^ c | e a c) J 9 K°i,c\ v t,c)9 K°£, F \ £,f) 

This completes the proof of the first result, if we choose 0% := 0™ and 0$ := Q' t The corollary for 
symmetric distributions q^' and follows by definition. □ 

Remark 3.2 (Recursive algorithm). Note that one particular choice for the coarse level proposal 
distribution in Step 1 of Algorithm 2 on each of the levels I > 1 is q ' := q i_1 , i.e. the "two-level" 
proposal distribution defined in Step 2 of Algorithm 2 on level £ — 1. We can apply this strategy 
recursively on every level and set q° to be, e.g., the pCN algorithm [11]. So proposals for Qg-\ 
and for Qn get "pre-screened" at all coarser levels, starting always at level 0. The formula for 
the acceptance probability or in Lemma 3.1 does not depend on q^' C and so it remains the same. 



However, this choice did not prove advantageous in practice. It requires £ + 1 evaluations of the 
likelihood on level £ instead of two and it does not improve the acceptance probability. Instead, 
we found that choosing the pCN algorithm for q e,c (as well as for cf ,F ) worked better. 

A simplified version of Algorithm 2, making use of the symmetry of the pCN proposal distri- 
bution and of the formulae derived in Lemma 3.1 is given in Section [5] and will be used for the 
numerical computations. 
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3.2 Convergence analysis 



Let us now move on to convergence properties of the multilevel estimator. As in Section 2.1 let 
K t (e t | 9\) := a\9 e \ 9' e ) q e (9 e \ ff t ) + f 1 - / a l {9 t \ 6' e ) q e (9 £ \ 9' e ) d9' f ) 5(9 e - 9' e ), 

\ JR R e J 
denote the transition kernel of {#™} n eN) and define, for all I = 0, . . . , L, the sets 

£ e = {9 £ : ir\9 e ) > 0}, 

V 1 = {9 e : q e (9 e \9' e )>0 for some 9\ G £ 1 }. 

The following convergence results follow from the classical results, due to the telescoping sum 
property (3.2) and the algebra of limits. 

Lemma 3.3. Provided £ l C T> , 7T^ is a stationary distribution of the chain {$^}n£N* 
Theorem 3.4. Suppose that for all i = 0, . . . , L, E n e [\Qe\] < 00 and 

q £ (9 £ | 9' e ) > 0, for all (9 e , 9' e ) G E l x £ e . (3.7) 

T/ien 

lim Q^ Ny =E^L[Q L ], for any 9° e G £ e and n e >0. 



Let us have a closer look at the irreducibility condition <\3.7\). As in (3.6), we have 



q t {9 l %) = a e ' C (9 etC \9' i<c )q i ' C (9 e ,c\9' ejC )q e ' F (9 £iF \e' e 



F) 



and thus Q holds, if and only if, for all {9 e ,9' e ) G ^xS 1 , 7r e - 1 (9 £}C ), q e ' C (9' ec \9^ c ), q^ c {9^ c \9' ifi ) 
and q e,F {9g t F\@e f) are an positive. The final three terms are positive for common choices of proposal 
distributions, such as the random walk sampler or the pCN algorithm. The first term can also be 
assured to be positive by appropriate choices for the likelihood and prior distributions. 

We finish the abstract discussion of the new, hierarchical multilevel Metropolis-Hastings MCMC 
algorithm with the main theorem that establishes a bound on the e-cost of the multilevel estimator 
under certain assumptions on the MCMC error, on the (weak) model error, on the strong error 
between the states on level i and on level i — 1 (in the two- level estimator for Yg), as well as on the 
cost Ci to advance Algorithm 2 by one state from n to n + 1 (i.e. one evaluation of the likelihood 
on level i and one on level t — 1). As in the case of the standard MCMC estimator, this bound 
is obtained by quantifying and balancing the decay of the bias and the sampling errors of the 
estimator. To state our assumption on the MCMC error and to define the mean square error of 
the estimator we define &£ := {#"} n eN U {@™_i} n <=N, for £ > 1, and ©o := {^ol^eN • 

Theorem 3.5. Let e < exp[— 1] and suppose there are positive constants a, a', (3, /3', 7 > such 
that a > \ min(/3, 7) and Rp > ^./™ ax W a filfi } _ \j n d er th e following assumptions, 

Ml. \E^[Q t -Q]\ < {M- a + R- a ') 

M2. W nW -i[Y e ] < M^+Rjf, 

M3. V e #^] + (Ee £ [^-Vw*-i[^]) a < JVT'V^-iK] 

M4. C e < M7, 
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there exists a number of levels L and a sequence {Ne}f =Q such that 



E, 



U e & e 



Q 



ML 



L,{N e } 



< £ 



and 



C £ (Q ML 



L,{N e } 



< 



--2 

e~ 2 | loge| 3 , 

£ -2-(y-$)/a | log£ | 



if P>1, 
if = 7, 
if P<1- 



Proof. The proof of this theorem is very similar to the proof of the complexity theorem in the 
case of multilevel estimators based on i.i.d samples (cf. 0, Theorem 1]), which can be found in the 
appendix of [9]. First note that by assumption we have RJ a < MJ a and Rg ^ < M £ 13 . 
Furthermore, similar to (2.6), we can expand 



ML \2 < v 



U £ 0£ 



L,{N t } 



+ 2(E u , 0f 



Q ML 



Since the second term in the MSE above can be bounded by 



Q 



ML 

L,{N e } 



E, 



Q 



ML 

L,{N t } 



L 

E 



E 



<(L + 1 



L 

E 

l=i 



vMC 



E e , 



E„ 



2 + 2(E nL [Q L -Q] 2 



MC- 



yMC 

*e,N e 



E„ 



where we have set Yq := Qo and E^o ^-i [Ej^] := E^ofQj^J, it follows from Assumption M3 that 



2 



(3.8) 



£=0 



In contrast to the MSE for multilevel estimators based on i.i.d samples, we hence have a factor 



(L + 1) multiplying the sampling error term on the right hand side of (3.8). This implies that in 
order to make this term less than e 2 /2, the number of samples Nt needs to be increased by a factor 
of (L + 1) compared to the i.i.d. case. The cost of the multilevel estimator is correspondingly also 
increased by a factor of (L + 1). The remainder of the proof remains identical. 



Since L is chosen such that the second term in (3.8) (the bias of the multilevel estimator) is 
less than e 2 /2, it follows from Assumption Ml that L + 1 < | loge|. The bounds on the e-cost then 
follow as in [91 Theorem 1], but with an extra | loge| factor. □ 

Assumptions Ml and M4 are the same assumptions as in the single level case, and are related 
to the bias in the model (e.g. due to discretisation) and to the cost per sample, respectively. 
Assumption M3 is similar to assumption Al, in that it is a non-asymptotic bound for the sampling 
errors of the MCMC estimator Ygjp. For this assumption to hold, it is in general necessary that 
the chains have been sufficiently burnt in, i.e. that the values ng are sufficiently large. 



4 Model Problem 



In this section, we will apply the proposed MLMCMC algorithm to a simple model problem aris- 
ing in subsurface flow modelling. Probabilistic uncertainty quantification in subsurface flow is of 
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interest in a number of situations, as for example in risk analysis for radioactive waste disposal or 
in oil reservoir simulation. The classical equations governing (steady state) single phase subsurface 
flow consist of Darcy's law coupled with an incompressibility condition (see e.g. [12] 110]): 

w + kVp = g and div w = 0, in D C R d , d=l,2, 3, (4.1) 

subject to suitable boundary conditions. In physical terms, p denotes the pressure head of the 
fluid, k is the permeability tensor, w is the filtration velocity (or Darcy flux) and g are the source 
terms. 

A typical approach to quantify uncertainty in p and w is to model the permeability as a random 
field k = k(x, w) onDx $7, for some probability space (fi, A, P). The mean and covariance structure 



of k has to be inferred from the (limited) geological information available. This means that (4.1) 



becomes a system of PDEs with random coefficients, which can be written in second order form as 

-V ■ {k(x,uj)Vp(x,u)) = f(x), in D, (4.2) 

with / := —div g. This means that the solution p itself will also be a random field onD x Q. For 
simplicity, we shall restrict ourselves to Dirichlet conditions p(u),x) = po(x) on dD, and assume 
that the boundary conditions po and the sources g are known (and thus deterministic). 



In this general form solving (4.2) is extremely challenging computationally and so in practice it 
is common to use relatively simple models for k that are as faithful as possible to the measurements. 
One model that has been studied extensively is a log-normal distribution for k, i.e. replacing the 
permeability tensor by a scalar valued field whose log is Gaussian. It guarantees that k > almost 
surely (a.s.) in Q, and it allows the permeability to vary over many orders of magnitude, which is 
typical in subsurface flow. 

When modelling a whole aquifer, a whole oil reservoir, or a sufficiently large region around a 
potential radioactive waste repository, the correlation length scale for k is typically significantly 
smaller than the size of the computational region. In addition, typical sedimentation processes 
lead to fairly irregular structures and pore networks, and faithful models should therefore also 
only assume limited spatial regularity of k. A covariance function that has been proposed in the 
application literature (cf. [23]) is the following exponential two-point covariance function for log A:: 

C(x,y) := cT 2 exp(- l|:E ~ y||r ) , x,y £ D, (4.3) 

where || • || r denotes the £ r -norm in R rf and typically r = 1 or 2. The parameters a 1 and A denote 
variance and correlation length, respectively. In subsurface flow applications typically only a 2 > 1 
and A < diam D will be of interest. This choice of covariance function implies that k is homogeneous 
and it follows from Kolmogorov 's theorem [26] that k{-,u) £ C°''(£>) a.s., for any £ < 1/2. 

For the purpose of this paper, we will for the remainder of this section assume that A; is a 



log- normal random field, where log/c has mean zero and exponential covariance function (4.3) with 
r = 1. However, other models for k are possible, and the required theoretical results can be found 
in [61 [301 [29]. 



Let us briefly put model problem (4.2) into context for the MCMC and MLMCMC methods 
described in sections [2] and [3j The quantity of interest Q is in this case some functional Q of 
the solution p, and Qm,r is the same functional Q evaluated at a discretised solution pm,r- The 
discretisation level M denotes the number of degrees of freedom (e.g. grid nodes for standard 



piecewise linear finite elements) for the numerical solution of (4.2) for a given sample, and the 
parameter R denotes the number of random variables used to model the permeability k. The 
random vector Xm,r will contain approximate values of the pressure p at M given points in the 
spatial domain D. 
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In order to apply the proposed MCMC methods to model problem (4.2), we hence need to 



represent the permeability k in terms of a vector 6 of random variables. For this, we will use the 
Karhunen-Loeve (KL-) expansion. For the Gaussian field log k, this is an expansion in terms of a 
countable set of independent, standard Gaussian random variables {£ n }neN- It * s given by 



log k(uj,x) 



oo 

£ 

n=l 



//S^> n (x)£ n (w) 



where {/i n }neN are the eigenvalues and {0 n }„ e N the corresponding L 2 -normalised eigenfunctions 
of the covariance operator with kernel function C{x,y). For more details on its derivation and 
properties, see e.g. [16]. We will here only mention that the eigenvalues {n n }neN are all non- 
negative with XmX)/- 4 " < +oo. For the particular covariance function (4.3) with r = 1, we have 



fi n < n and hence there is an intrinsic ordering of importance in the KL-expansion. 

Truncating the KL-expansion after a finite number R of terms gives an approximation of k in 
terms of R standard normal random variables, 



k R (cv,x) = exp 



The coefficients {£n}n=i wm be our input random vector 6 in the MCMC algorithms. To achieve a 
level-dependent representation of k, we simply truncate the KL-expansion after a sufficiently large, 
level-dependent number of terms Ri, such that the truncation error on each level is bounded by 
the discretisation error, and set 9g := {£n}n=i- 

For the spatial discretisation of model problem ( |4.2[ ), we will use standard, continuous, piecewise 
linear finite elements (see e.g. [U [8] for more details). Other spatial discretisation schemes are 
possible, see for example [9] for a numerical study with finite volume methods and |19j for a 
theoretical treatment of mixed finite elements. We choose a regular triangulation Th of mesh width 
h of our spatial domain D, which results in M = 0(h~ d ) degrees of freedom for the numerical 



approximation. A sequence of discretisation levels Mi satisfying (3.1) can then be constructed by 
choosing a coarsest mesh width ho, and choosing hi := s~^ho. A common (but not necessarily 
the optimal) choice is s = 2 and uniform refinement between the levels. We will denote the finite 
element solution on level £ by pi. 

Let us finally specify the prior distribution and likelihood model that we will assume for the 
remainder of this paper. The prior distribution Vi of 6i is simply a standard ^-dimensional 
Gaussian: 



1 



(2vr)^/ 2 



exp 



2 

7=1 



(4.4) 



For the likelihood we also choose a normal distribution, centred around the model response F i (8i) 
F(pi{0i)) and with variance <Jpf 



A (F b s I Oi) oc exp 



obs 



24, 



(4.5) 



Recall that the coarser levels in our multilevel estimator are introduced only to accelerate the 
convergence and that the multilevel estimator is still an unbiased estimator of the expected value 
of Ql with respect to the posterior ir L on the finest level L. Hence, the posterior distributions on 
the coarser levels 7r , £ = 0, . . . , L — 1, do not have to model the measured data as faithfully as 
tt l . In particular, this means that we can choose larger values of the fidelity parameter Opo on the 
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coarse levels, which will increase the acceptance probability on the coarser levels, since it is easier 
to match the model response i* 1 (0^) with the data -Fobs- As we will see below (cf. Assumption A3), 
the growth in a 2 Fi has to be controlled. 



4.1 Convergence analysis 

We now perform a rigorous convergence analysis of the MLMCMC estimator Q^\ni} introduced 



in Section [3] to the described model problem (4.1). Using Theorem 3.4, we will first verify that 



indeed this multilevel estimator is an unbiased estimator of E^l [Q l] , before we go on to quantify 
its computational cost by verifying the assumptions of Theorem |3.5| 

To conclude that the multilevel estimator converges to the correct expected value E^l [Q l] as 



the number of samples tends to infinity, we only need to verify the irreducibility condition (3.7) 
in Theorem |3.4[ As already noted in Section [3j for common choices of proposal distribution, the 
condition holds true if we have 7f (0^c) > for all 6? s.t. n (0t) > 0. Since both the prior and 
the likelihood were chosen as normal distributions, and normal distributions have infinite support, 
the conclusion then follows. 

Theorem 4.1. Suppose that for all £ = 0, . . . , L, E^ [\Qe\] < 00. Then 

lim Q^\ NA = E nL [Q L ] , for any 6% G E l and n e > 0. 

{A^}->oo ' x " 

Let us now move on to quantifying the cost of the multilevel estimator, and verify that the 
assumptions in Theorem |3 . 5| hold for our model problem. We will prove Ml and M2. As mentioned 
earlier, assumption M3 involves bounding the mean square error of an MCMC estimator, and a 
proof of M3 is beyond the scope of this paper. Results of this kind can be found in e.g. [281 12"0] - 
We will also not address M4, which is an assumption on the cost of obtaining one sample of Qg. In 
the best case, with an optimal linear solver to solve the discretised (finite element) equations for 
each sample, M4 is satisfied with 7 = 1. 

Since they will become useful later, let us recall some of the main results in the convergence 
analysis of multilevel Monte Carlo estimators based on independent and identically distributed 
(i.i.d.) samples, rather than samples generated by Algorithm 2. An extensive convergence analysis 



of finite element multilevel estimators based on i.i.d. samples for model problem (4.1) with log- 
normal coefficients can be found in [6, 30, 29]. We firstly have the following result on the convergence 
of the finite element error in the natural i? 1 -norm. 

Theorem 4.2. Let g be a Gaussian field with constant mean and covariance function (|4.3|) with r = 



1, and let k = exp[g] in model problem (4.2). Suppose D C R is Lipschitz polygonal (polyhedral). 
Then 



E- 



Pt 



\p-Pe\ q m {D ) 



1/q <C k , f , P0 , q (Mi 1/2d+6 + RJ 1/2+S ) 



for any q < 00 and 5 > 0, where the (generic) constant Cfcj iP0i9 (here and below) depends on the 
data k, f, po and on q, but is independent of any other parameters. 

Proof. This follows from [30^ Proposition 4.1]. □ 



Convergence results for functionals Q of the solution p can now be derived from Theorem 4.2 
using a duality argument. We will here for simplicity only consider bounded, linear functionals, 
but the results can easily be extended to any continuously Frechet differentiable functional (see 
|3U[ §3.2]). We make the following assumption on the functional Q (cf. Assumption Fl in |30j). 
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A2. For given u £ 0, let Q : H l (D) 
Cg £ L q (Q), such that 



be linear, and suppose that, for any q < oo, there exists 



G(v)\<Cg(u)\\v\ 



for all 5 > 0. 



An example of a functional which satisfies A2 is a local average of the pressure, J D ^pdx for some 
D* C D. The main result on the convergence for functionals is the following. 



Lemma 4.3. Let the assumptions of Theorem 4-2 be satisfied, and suppose Q satisfies A2. Then 



E Ve [\g( P )-g(p e )\^ <c k , f!P0>q (m, 



l/d+8 D -l/2+<5 

t + K e 



for any q < oo and 5 > 0. 

Proof. This follows from [30^ Corollary 4.1]. 



□ 



Note that assumption A2 is crucial in order to get the faster convergence rates of the spatial 

For multilevel estimators based on i.i.d. samples, it follows 



discretisation error in Lemma 4.3 



immediately from Lemma 4.3 that the (corresponding) assumptions Ml and M2 are satisfied, with 
a = l/d + 5, a' = 1/2 + 5 and /3 = 2a, P' = 2a', for any 5 > (see [30] for details). 

The aim is now to generalise the result in Lemma 4.2 to include the framework of the new 
MLMCMC estimator. There are two issues which need to be addressed. Firstly, the bounds in 



assumptions Ml and M2 in Theorem 3.5 involve moments with respect to the posterior distributions 



7r , which are not known explicitly, but are related to the prior distributions Ve through Bayes' 
Theorem. Secondly, the samples which are used to compute the differences Q" — Q™_ 1 are generated 
by Algorithm 2, and may differ not only due to the truncation order, but also because they come 
from different Markov chains (i.e. 0™_ x is not necessarily equal to OgQ, as seen in Table fll). 

To circumvent the problem of the intractability of the posterior distribution, we nave the 
following lemma, which relates moments with respect to the posterior distribution 7r^ to moments 
with respect to the prior distribution Vt . 

Lemma 4.4. For any random variable Z = Z(0g) and for any q s.t. Ep £ \\Z\ q ] < oo, we have 

\E^[Z«} \ < EpjZ|*]. 

Similarly, for any random variable Z = Z(9e, and for any q s.t. K-p^-p t _ 1 [\Z\ g ] < oo, we have 



Proof. Using Bayes' Theorem (2.1), we have 
Jr r i /-M-^obsJ 



< 



sup e , [A (-Fobs I 0e)] 



V F {F t 



obs I 



\Z(9 e )\ q V £ (6e) 



Since the likelihood £,£ is not a discrete probability measure, so that sup 6f [Ce(F b s \ 0e)] < oo, the 
first claim of the Lemma then follows, since VF(F bs) is a constant. The second claim of the Lemma 
can be proved analogously. □ 



Note that it follows immediately from Lemmas |4.3| and |4.4 | and the linearity of expectation that 

5 and a' 



assumption Ml in Theorem 3.5 is satisfied, with a = l/d 



1/2 



5, for any 5 > 0. In 

order to prove M2, we further have to analyse the situation where the two samples 0" and 
used to compute "diverge", i.e. when S"_ 1 / 9^ c . 

We need to make the following two assumptions on the parameters Op , in the likelihood model 



(4.5) and on the growth of the dimension R £ . 
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A3. The dimension Ri — > oo as E — > oo and 

Ri-Ri 



{Re - Rt-i)(2Tc)- 



< rf-1/2+6 
~ K l-1 



A4. The sequence of fidelity parameters {o~ 2 Fe } c ^L satisfies 



-l ~ 



-1/2+5 
-1 ' 



-l/d+8 
L l-1 



for all 5 > 0. 



for all <5 > 0. 



For A3 to be satisfied it suffices that Rg — Re~i grows logarithmically with Ri-\. Assumption A4 
holds for example, if we choose the fidelity parameter to be constant for all E > Eq, for some Eq > 0. 

Under these assumptions we can now prove that assumption M2 in Theorem 3.5 is satisfied, 
with = 1/d- 8 and /3' = 1/2 - 5, for any 5 > 0. 

Lemma 4.5. For n G N, let 8™ and be the nth states of the Markov chains generated by 

Algorithm 2. Let the assumptions of Theorem \4-S\ as well as Assumptions A3 and A4 hold, and 

Qiffl)-Qt-l&t-i)- Then 



suppose that J- and Q both satisfy Assumption A2. Denote Yp 



nty-i [Y e n ] < C kJjP0 I M,}' d+5 + R e l[^°) , for any 5 > 0. 



-1/2+5N 



To prove Lemma 4.5 we first need some preliminary results. Firstly, note that for Q™^ 1 ^ 9"^} 
to be the case, the proposal on level I at state n + 1 had to be rejected. Given the proposal 9' e 
and the previous state 6™, the probability of this rejection is given by 1 — a l {9' l \9™'). We need to 
quantify this probability, and this leads to the following crucial result. 



Theorem 4.6. Suppose T satisfies A2 and A3 and A4 hold. Then 

lim a l (9'^ I O'l) = 1, for Vi — almost all 9'p, 9". 

Furthermore, 



(1 



a 



l/d+5 D -l/2+«5\ 



■ 



for any q < 00 and 5 > 0. 

Proof. We will first derive a bound on 1 — a (9{ | 9'I), for £ > 1 and for 9'^ and 9" given. First note 
that if ^(g/^/) w i-i(o'' C ) — 1> then 1 — a | #") = 0. Otherwise, we have 



a: 



At: 



7T 



< 



1 



( e 't,c 



+ 



TT 



£-11 



'LC> 



+ 



1 



'n i (8' e )7r e - 1 (9'l 




TT 



£ran\ 



TT' 



-\9» 



LC> 



TT 



'- 1 (8" 



1<G> 



(4.6) 



respectively, we have 



Let us consider either of these two terms and set 9^ = (£j)^Li to be either 9'^ or 9". Using the 
definition of it in (2.1), as well as the models (4.4) and (4.5) for the prior and the likelihood, 

Ct(F o ha \8e) 
bs\9e,c) 



Ah 



V £ (9 e 



^-\0i,c) " Ve.^c) £e~i(F ohs \9 e ,c) 



(4.7) 



exp 



2vr 



E 



\F, 



obs 



Ft 



+ 



ll^obs - F t . 



-1\?LC 



j=R e _ 1 +l 



<TT 



O'l 
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Denoting Fg := F(9g) and := F(9e,c), an d using the triangle inequality, we have that 

( !. ^<l- - f i-J !. + L f / - i-t-\ I! 



^obs 



^11' 



I -^obs — f^-lll 2 



I -fobs — Fl-l\ 



a 



Fl-l 



<T7 



a 



F,l-l 



obs 



Fi-i\\ 2 (a 



F,£ 



a 



F,t-l 



2||F obB -F / _i || + 11^-^-1 1| 

+ o W-fe — fit-i\ 



a 



F.t 



Since T was assumed to satisfy A2, it follows from the theory in [5l[3Q] (for the particular covariance 
function C(x,y) in (4.3) with r = 1) that 



\F £ - Fi-xW < C ke j >po (0e) (jk£ - fe^-iUcro 



(D) 



+ M, 



-l/d+S 



for almost all 9g and for a constant C kl j p (9() < oo that depends on 0£ only through kg := 
ex P {^Zf=x \/V^ ( t ) 3^jj- Since ||i 7 £_i|| can be bounded independently of £, for almost all Og (again 
courtesy of Assumption A2), and since ||-F bs — -ff-ill — II -fobs 1 1 + l|f)-i||> we can deduce that 



ll^obs - f>|| 2 \\F ohs -F e ^ 2 



a 



Fl-l 



^ Ck e j, PO (0e) ( (v F>t -° F ,i-i) + II** ~ h-x\\c°(D) + M > 



-l/d+6 



Finally, substituting this into (4.7) and using the inequality |1 — exp(x)| < |x| exp |x| we have 

n ^ < C ki j, P0 (9 £ ) ((2^)- Ri ^Ce + - <Tp}-i) + \\h - h-x\\ cm + M[ 1/d+s 

(l.S) 



n l ~ 1 (9 t n) 



for almost all 9%, where Q := Ylf=R t x +i £f> i- e - a realisation of a x 2 -distributed random variable 
with Rf — Ri-i degrees of freedom. 

Now as £ — > oo, due to Assumption A3 we have Ri — > oo and (2ir)~^ Rt ~ Rt ~ 1 '' 2 Ct — > 0, almost 
surely. Moreover, Mi — > oo and it follows from [3 Prop. 3.6 & §7.1] that \\kg — fef-illeooj) — > 0, 
almost surely. Hence, using also A4 we have 



lim 

t— -too 



0, for almost all 9p. 



The first claim of the Theorem then follows immediately from (4.6). 

For the bound on the moments of 1 — ap, we use that all finite moments of Ck e j )P0 (9g) can be 
bounded independently of £ (cf. [30]). It also follows from [5J Prop. 3.11 & §7.1] that 



\k e - k^i ll<J 



C°(£>) 



^ Re , for any o > (J, q < oo. 



Finally, since Q is ^-distributed with Rg — Ri-\ degrees of freedom, we have 



HWC] = 2" 



T(\(Rg- Rg^+q) 

r(±(Rt-Rt-ij) 



< (R e - Rg.-i) q , for any 5 > 0, q < oo. 



Thus, the bound on the qth moment of 1 — ag follows immediately from (4.8), Assumptions A3 
and A4 and Holder's inequality. □ 



We will further need the following result. 
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Lemma 4.7. For any 9e, let ke(9e) := exp ( Ylf=i yffij4>j£j) an d K (9i) := m ^ n x eD ^('' x )- Then 



\n{»t) - rift 



for almost all 9i,9' e 



and 



E 



V e ,V e 



V<7 



< constant, 



\pe(0e) - Pe{0'e)\ Q H i {D) 
for any q < oo, where the hidden constants are independent of £ and pn. 
Proof. Using the definition of k(9^), as well as the identity 

/ ki(9t)Vpi(0 t ) -Vvdx= [ fvdx= [ k e (fi) Vpe(0' e ) ■ Vvdx, for all v G H%(D), 
Jd Jd Jd 

(deduced from (|4.2|)) we have 



(4.9) 
(4.10) 



<0i)\p. 



pM)\m(D) < 



D 



kt(0 { ) V (p e (0 e ) - P i(6' e )) • V ( Pe (0 t ) - p £ (9' e )) dx 

< I (k t (9 e ) - kM) VpM) • v (p e (e e ) -pM)) dx. 



D 



Due to the standard estimate \pi(9i)\h 1 (d) i$ \\f\\H- 1 (D)/ K (^i) this implies (4.9) 
It follows from Prop. 3.10] that E Ve [«(0/)-«] and Ep £i ^ ||^(^)||J ^ 
independently of £. The result then follows from an application of the Minkowski inequality to 

1/9 



can be bounded 



E 



\\ke(0t) — /c^(^) 11^0(75) > as wen as Holder's inequality. 



□ 



Using Theorem 4.6 and Lemma 4.7, we are now ready to prove Lemma 4.5 



Proof of Lemma 4-5 Let Of and 0"_i be the nth states of the Markov chains generated by Algo- 
rithm 2 on level I. It follows from Lemma 4.4 and the fact that Y^fX] < E 7r [X 2 ], for any random 
variable X and any measure tt, that 



V^-i [Qtffl) - Qe-i(@ti)] < ^Vi,Vi-i {Qe(0e) - Qt-x^-i))'' 



(4.11) 



Now, to simplify the presentation let us set 9 := 0", 6c = 9™ c and 9p = 9™ F , and denote by 
9' = 9'f the proposal generated at the nth step of Algorithm 2 with 9' c = 0?_i and with some 9' F . 
Note that 9' ^ 9 only if this proposal has been rejected at the nth step. It follows from (4.11) by 
the triangle inequality that 



[Qe(9) - Qi 



< E 



Vi,T t 



{Qi{9) - Qt{9')f +® Pt ,v t _ 1 {Qt{9') - Qt-i{9' c )) 



A bound on the second term follows immediately from Lemma 4.3 



i.e. 



(Qi(9')-Qi-i(0' c )) 2 } < C kJ>Po (Mi 2/d+S + R 



-1+5 



(4.12) 
(4.13) 



The first term in (4.12) is nonzero only if 9 ^ 9'. We will now use Theorem 4.6 and Lemma 4.7 



as 



well as the characteristic function I^g^giy G {0, 1} to bound it. Firstly, Holder's inequality gives 



{Qi(9) - Qi(9')Y 



{Qi{e)-Qt(9')) 2 \e^} 



(Q e (9) - Q e (9')) 



1/92 



(4.14) 
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for any q±,q2 s.t. q 1 1 + q 2 1 = 1. Since the functional Q was assumed to be li near and bounded 



4.7 



that the term 



on H^ijT) C H l l 2 ~ 5 , for all 5 > (Assumption A2), it follows from Lemma 
^Pt,Vt [ (Qt(Q) ~ Qe(G')) 2qi ] in (4.14) can be bounded by a constant independent of £, for any 
qi < oo. Moreover, using the law of total expectation, we have 



E 



V e ,V e I 1 {9^6'} 



[e^e'\ e, e'] 



Since 9^9' only if the proposal 9' has been rejected on level I at the nth ste p, th e probability 
that this happens can be bounded by 1 — a e (9'\9), and so it follows by Theorem 



4.6 



that 



< M7 l ' d+S + R- l ' 2+S 



Combining (4.12)-(4.15) the claim of the Lemma then follows. 



(4.15) 

□ 



We now collect the results in the preceding lemmas to state our main result of this section. 



Theorem 4.8. Under the same assumptions as in Lemma 4-5, the Assumptions Ml and M2 in 
Theorem 3.5 are satisfied, with a = (3 = 1/d — 5 and a' = f3' = 1/2 — 5, for any 5 > 0. 

If we assume that we can obtain individual samples in optimal cost Ci < hJ d log(hJ 1 ), e.g. via 
a multigrid solver, we can satisfy Assumption M4 with 7 = 1 + 5, for any 5 > 0. Then it follows 
from Theorems 3.5 and 4.8, as well as equation (2.11), that we can get the following theoretical 
upper bounds for the e-costs of classical and multilevel MCMC applied to model problem (4.2) 
with log-normal coefficients k, respectively: 



Ce(Q 



MC\ 
N ) 



< -(d+2)-8 



and C £ {Qf 



ML \ 



for any 5 > 0. 



(4.16) 



We clearly see the advantages of the multilevel method, which gives a saving of one power 
of e compared to the standard MCMC method. Note that for multilevel estimators based on 
i.i.d samples, the savings of the multilevel method over the standard method are two powers of 
e for d = 2,3. The larger savings stem from the fact that (3 = 2a in this case, compared to 
f3 = a in the MCMC analysis above. The numerical results in the next section for d = 2 show 
that in practice we do seem to observe (SpsIw 2a, suggesting C £ (Q^^ N ^) = 0{e~ d ). However, 
we do not believe that this is a lack of sharpness in our theory, but rather a pre-asymptotic 
phase. The constant in front of the leading order term in the bound of V^f n e-i [Y™], namely the 

, depends on the difference between Qi(9f) and 
Qe{9g). In the case of the pCN algorithm for the proposal distributions q > and q^ ,F (as used in 
Section [5] below) this difference will be small, since 9 and 9' will in general be very close to each 
other. However, the difference is bounded from below and so we should eventually see the slower 
convergence rate for the variance as predicted by our theory. 



term &p t ,V t [ {QlWi) ~ QWl)) ] in < 4 - 14 



5 Numerics 

In this section we describe the implementation details of the MLMCMC algorithm and examine 
the performance of the method in estimating the expected value of some quantity of interest for 
our model problem (4.2 ). We start by presenting in Algorithm 3 a simplified version of Algorithm 2 
given in Section 3 using symmetric proposal distributions for q^' C and q e ' F , describing in some more 
detail the evolution of the multilevel Markov chain used to approximate ^-lfY^]. 
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ALGORITHM 3. (Simplified Metropolis Hastings MCMC for Y e , i > 0) 

Choose initial states and 99. For n > 0: 

• On level £- 1: 

- Given 0"_ x , generate G^_ x from a symmetric distribution g £,c '(0^_ 1 |0^_ 1 ). 

- Compute a e > C (@Um-i) = min |l, ^|^] | • (5-1) 

_ Set e n+i = J ^-i with probability ^(G^G^) 

\ G™_ 1 with probability 1 - a e ' C (Q'^Q^) . 

• On level I: 

- Given let 0' iC = 0£lf and draw ^ F from a symmetric distribution q e ' F (9' e F \0™ F ) 



-C„ m p Ut e ^.^.M&L^j. (5 „ 

_ Set r+ l = f ^ = I ?-i > e 'lM with probability o«(0J|0?) 

6 1 [Of with probability 1 - o*(0£|0?). 

Compute = Qe (O - Qe-i (G^ 1 ) (5.3) 



5.1 Implementation Details 

Given the general description of the multilevel sampling in Algorithm 3, it remains to describe 
several computational details of the method, such as the choice of the symmetric transition prob- 
abilities q (G^_l|G£_i) and q £,F (9' e F \0£ F ), the values Re defining the partition of the KL modes 
over the multilevel hierarchy, as well as various MCMC tuning parameters. 

For all our symmetric proposal distributions q e,c and q e,F , I = 1, . . . ,L, we use the so-called 
precondition Crank-Nicholson (pCN) random walk proposed by Cotter et al. in [TT]. Given the 
current state 9 n , the j th entry of the proposal is obtained by 

#, = ^1-0*0? + (5.4) 

where £j ~ N (0, 1) and f3 is a tuning parameter used to control the size of the step in the proposal, 
that may be chosen level dependent, i.e. (3 = jig. In the numerical experiments, we typically choose 
&<A>fbr*=l,...,L. 

The other free parameters in Algorithm 3 are the parameters uL found in the likelihood model 



described in (4.5). The value of a Fi controls the fidelity with which we require the model response 
to match the observed data on level I. In our implementation we fix the fine-level likelihood variance 
cr FL to a value consistent with traditional single level MCMC simulations (i.e. the measurement 
error associated with F b s in a practical application), and then allow the remaining parameters to 
increase on coarser levels. This is done for two reasons. First, because the coarse simulations do 
not include all stochastic modes of the model, and so the coarse approximation will not necessarily 
agree exactly with the observed data. Second, since the coarse approximations necessarily include a 
higher level of discretisation error, it makes sense to relax the restrictions on the agreement between 
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the model response and the observed data. Due to the consistency of the multilevel estimator, the 
choices of cr F p, £ < L, will only influence the overall cost of the estimator and not the bias. The 
particular values used in the presented numerical experiments are chosen so that the values of 
likelihood variance increase with the characteristic mesh size in the hierarchy. Specifically we fix 
the value of a FL on the finest grid, and then set 

o 2 F/ = (I + nhi) a 2 F/+l , e = Q,...,L-l, (5.5) 

where hi is the mesh size on level I and k is a tuning parameter. This choice ensures assumption A4. 

To reduce dependence of the simulation on the initial state of the Markov chain, and to aid 
in the exploration of the potentially multi-modal stochastic space, we simulate multiple parallel 
chains simultaneously. The variance of the multilevel estimator VqJY^ 1 ^] is approximated on 
each grid level by sj N using the method of Gelman and Rubin |15| . Finally, due to the very high- 
dimensional parameter space in our numerical experiments, both the single-level and multilevel 
samplers displayed poor mixing properties. As such, we use a thinning process to decrease the 
correlation between consecutive samples, whereby we include only every T th sample in the approx- 
imation of the level-dependent estimator, where T is some integer thinning parameter [27] . Then, 
after discarding no initial burn-in samples, the approximation of K n i n i-i [Ye] is computed by 



n +N e 

yMC ._ 1 \^ v (nT) 



n=no 



After the initial burn-in phase, the multilevel MCMC simulation is run until the weighted sum 
of the estimators from the L + 1 grid levels satisfies 

y ^ < - (5.6) 

e=o 1 

for some user prescribed tolerance e. The total cost of the multilevel estimator is minimised when 
the number of samples on each level is chosen to satisfy 

Ni cx y/v^i-i [Y] /C e « yfifa/Ct, (5-7) 

as described in , where Cg is the cost of generating a single sample of Yg on level i. We assume 
this cost to be expressed as 

C t = C* V JMJ, (5.8) 
where the constant C* may depend on the parameters a 2 and A in (4.3), but does not depend on 



t. The factors rj£ reflect the additional cost for the auxiliary coarse solve required on grid £ — 1. 
For the experiments presented below, with geometric coarsening by a factor of 4, we have 770 = 1 
and rj£ = 1.25 for j = 1, . . . , L. When an optimal linear solver (e.g. algebraic multigrid) is used to 
perform the forward solves in the simulation we can take 7 ~ 1. For a given accuracy e, the total 
cost of the multilevel estimator can be written as 

L 

Cs(qTwa) ^TANe. (5.9) 



1=0 
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5.2 Numerical Experiments 



We consider (4.2) defined on the domain D = (0,1) with / = 1. The boundary conditions are 



taken to be Dirichlet on the lateral boundaries of the domain, and Neumann on the top and bottom: 



p\ Xl =0 = 1) P\x r - 



0. 



dp 

dn 



0. 



x 2 =0 



dp 
dn 



0. 



SD2=1 



The quantity of interest we approximate in our numerical simulation is the flux through the 
''outflow" part of the boundary, given by 



jout 



1 k— 
u dx 1 



dX2- 



(5.10) 



X\ =1 



The (prior) conductivity field is modelled as a log-normal random field with covariance function 
(4.3) with r = 1. The "observed" data i^bs is obtained synthetically by generating a reference 



conductivity field from the prior, solving the forward problem, and evaluating the pressure at 9 
randomly selected points in the domain. The domain is discretised using piecewise linear finite 
elements on a 2D uniform triangular mesh. The coarsest mesh contains mo = 16 grid points in 
each direction, with subsequently refined meshes containing rag = 2^mo in each direction, with the 

Five parallel chains are used in each 



m 



£• 



total number of grid points on level I defined as 
multilevel estimator. 

The top two plots in Figure [l] show the results of a four-level simulation with A = 0.5, a 2 = 1, 
and mo = 16. The partition of the KL modes was such that Rq = 96, R\ = 121, R2 = 153, and 
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i?3 = 169. Five parallel chains were used for each level-dependent estimator. The fidelity parameter 
in the likelihood on the finest grid was taken to be c%t = 10~ 4 . The fidelity parameters on the 
other levels were obtained by equation (5.5) with k = 1. The simulation was stopped when the 
variance of the multilevel estimator reached e 2 /2 with £ = 8x 10~ 4 . The top left plot compares the 
variance of quantities Qg and Yg on each level. The top right plot compares the mean of quantities 
Qg and Yg on each level. The plots for Yg seem to decay with 0{h 2 ) and O(hg), respectively. This 
suggests that at least in the pre-asymptotic phase our theoretical result on the variance which 
predicts O(hg) (in Theorem 4.8) is not sharp (see comments at the end of Section [4|. The result 
on the bias seems to be confirmed. 

The bottom two plots in Figure [T] show the number of samples Ng required on each level of 
the multilevel MCMC sampler and compare the computational cost of the standard and multilevel 
MCMC samplers for varying values of accuracy e, respectively. Note that for larger values of e 
fewer grid levels are required to attain a reduction in variance equivalent to the spatial discretisation 
error. The total cost of the simulation is given in terms of the cost of one forward solve on the 
coarsest grid (which is the same in each case). The y-axis is scaled by e 2 . It is clear that the 
multilevel sampler attains a dramatic reduction in computational cost over the standard MCMC 
sampler. The precise speedup of the multilevel over the standard algorithm can be evaluated by 
taking the ratio of the total cost of the respective estimators, as defined by (5.7)-(5.9). When an 
optimal linear solver (such as AMG, with 7 ~ 1) is used for the forward solves in the four- level 
simulation with e = 8 x 1CP 4 (as in the top plots of Figure [l]), the computational cost of the 
simulation is reduced by a factor of 50. When a suboptimal linear solver is used (say, 7 « 1.5 for 
a sparse direct method) the computational cost is reduced by a factor of 275. 

Figure ([2]) (left) confirms that the average acceptance rates ag of the fine-level samplers - the 
last three dots in Figure @ (left) - tend to 1 as I increases, and E[l — ag] ~ O(hg), as predicted in 
Theorem 4.6. Finally, the results in Figure ([2]) (right) demonstrate the good agreement between the 
MLMCMC estimate QY\nA anc * tne stan dard MCMC estimate Q^ c of the quantity of interest q out 
for nine distinct sets of reference data with three levels of fine-grid resolution. As before, the coarse 
grid in each case was defined with tuq = 16, the tolerance for both estimators was e = 8 x 10 -4 and 
the model for the log-normal conductivity field is parametrised by A = 0.5, a 2 = 1 and Rl = 169 
on the finest grid. 
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